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Abstract 

Methods of transfer learning try to combine knowledge from several related tasks (or domains) to 
improve performance on a test task. Inspired by causal methodology, we relax the usual covariate shift 
assumption and assume that it holds true for a subset of predictor variables: the conditional distribution 
of the target variable given this subset of predictors is invariant over all tasks. We show how this 
assumption can be motivated from ideas in the field of causality. We prove that in an adversarial setting 
using this subset for prediction is optimal if no examples from the test task are observed; we further 
provide examples, in which the tasks are sufficiently diverse and the estimator therefore outperforms 
pooling the data, even on average. If examples from the test task are available, we provide a method to 
transfer knowledge from the training tasks and exploit all available features for prediction. We introduce 
a practical method which allows for automatic inference of the above subset and provide corresponding 
code. We present results on synthetic data sets and a gene deletion data set. 


1 Introduction 


Standard approaches to supervised learning assume that training and test data can be modeled as an i.i.d. 
sample from a distribution P := p(^A)^ xhe inputs X are often vectorial, and the outputs Y may be 
labels (classification) or continuous values (regression). The i.i.d. setting is theoretically well understood 
and yields remarkable predictive accuracy in problems such as image classification, speech recognition and 
machine translation [e.g. Schmidhuber 2015 Krizhevsky et al. 2012 . However, many real world problems 


do not fit into this setting. Distributions may change between training and testing, and work in the field of 
transfer learning attempts to address this. We begin by describing the problems of domain generalization 
and multi-task learning, followed by a discussion of existing assumptions made to address the problem of 
knowledge transfer, as well as the new assumption we assay in this paper. 


1.1 Domain generalization and multi-task learning 

Assume that we want to predict a target L € K from some predictor variable X € K^. Consider D training (or 
source) taskLJp^, ... ,P^ where each P^ represents a probability distribution generating data (X^, Y^) ~ P^. 
At training time, we observe a sample for each source task k G {1,...,Z1}; at test time, we 

want to predict the target values of an unlabeled sample from the task T of interest. We wish to learn a 
map / : IRP —)• K with small expected squared loss £pt(/) = ~ /(X^))^ on the test task T. 

^In this work, we use the expression “task” and “domain” interchangeably. 
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method 

training data from 

test domain 

Domain Generalization (DG) 

(XI,yi), 
(xi,yi). 

..,(x^,y^) 

..,(x^,y^),x^+i 

T:=D + 1 

Asymmetric Multi-Task Learning (AMTL) 

(XI,yi), 
(xi,yi). 

..,(x^,y^) 

..,(x^,y^),x^ 

T := D 

Symmetric Multi-Task Learning (SMTL) 

(XI,yi), 
(XI,yi). 

..,(x^,y^) 

..,(x^,y^),xi,...,x^ 

all 


Table 1: Taxonomy for domain generalization (DG) and multi-task learning (AMTL and SMTL). Each 
problem can either be used without (first line) or with (second line) additional unlabelled data. 


In domain generalization (DG) [e.g. Muandet et al. 2013 , we have T = D -|-1, that is, we are interested 


in using information from the source tasks in order to predict from in a related yet unobserved 

test task To beat simple baseline techniques, regularity conditions on the differences of the tasks are 

required. Indeed, if the test task differs significantly from the source tasks, we may run into the problem of 


negative transfer Pan and Yang 

2010 and DG becomes impossible 

Ben-David et al. 

20101 

If examples from the test task are available [e.g. Pan and Yang[ 

2010 

Baxter ^ 

>000 , 


we refer to the 


training tasks [e.g. Garuana, 1997|, we call the problem symmetric multi-task learning, see Table for a 


summary of these settings. In MTL (this includes both AMTL and SMTL), if infinitely many labeled data 
are available from the test task, it is impossible to beat a method that learns on the test task and ignores 
the training tasks. 


1.2 Prior work 


2009 Schweikert 


A first fami ly of methods assumes that covariate shift holds [e.g. Quionero-Candela et al. 

This states that for all fc G {!,..., D, T}, the conditionals \ X'^ are invariant between tasks. 


et al. 


2009 


Therefore, the differences in the joint distribution originate from a difference in the marginal distribution 
of X^. For instance, if an unlabeled sample from the test task is available at training in the DG setting, 
the training sample can be re-weighted via importance sampling Gretton et al. 2009 Shimodaira 2000 


Sugiyama et al. 2008 so that it becomes representative of the test task. 


Another line of work focuses on sharing parameters between tasks. This idea originates in the hier- 

For instance, [Lawrence and Platt 2004 


archical Bayesian literature Bonilla et al., 2007 Gao et al. 2008 


introduce a model for MTL in which the mapping fk in each task fc G T} is drawn independently 

from a common Gaussian Process (GP), and the likelihood of the latent functions depends on a shared 
parameter 9. A similar approach is introduced by Evgeniou and Pontil [2004] : they consider an SVM with 
weight vector = wq + , where wq is shared across tasks and is task specific. This allows for tasks 

to be similar (in which case does not have a significant contribution to predictions) or quite different, 
use a related approach for MTL. 


Daume III et al. 2010 


An alternative family of methods is based on learning a set of common features for all tasks 

For instance. 


et al.j 2007a[ Romera-Paredes et al. 2012, Argyriou et al. 2007b Raina et al. 2007 


Argyriou 


Argyriou 


et al. 2007a|b propose to learn a set of low dimensional features shared between tasks using regularization, 
and then learn all tasks independently using these features. In Raina et al. 2007 , the a uthors const ruct a 
similar set of features using regularization but make use of only unlabeled examples. Chen et al. 


2012 


proposes to build shared feature mappings which are robust to noise by using autoencoders. 

Finally, the assumption introduced in this paper is based on a causal view on domain adaptation and 
transfer. Scholkopf et al. 2012 relate multi-task learning with the independence between cause and mech¬ 


anism, but do not propose a concrete algorithm. This notion is closely related to exogeneity Zhang et al. 
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2015b , which roughly states that a causal mechanism mapping a cause X to Y should not depend on the 


distribution of X. Additionally, Zhang et al. 2013 consider the problem of target and conditional shift when 


the target variable is causal for the features. They assume that there exists a linear mapping between the 
covariates in different tasks, and the parameters of this mapping only depend on the distribution of the target 


variable. Moreover, Zhang et al. 2015a argue that the availability of multiple domains is sufficient to drop 
this previous assumption when the distribution of and the conditional X^' | Y^ change independently. 
The conditional in the test task can then be written as a linear mixture of the conditionals in the source 
domains. The concept of invariant conditionals and exogeneity can also be used for causal discovery 


et al., 2015 Zhang et al., 2015b 


Peters 


1.3 New contribution 

Taking into account causal knowledge, our approach to DG and MTL assumes that covariate shift holds 
only for a subset of the features. From the point of view of causal modeling Pearl 2009 , assuming invariance 
of conditionals makes sense if the conditionals represent causal mechanisms [e.g. Hoover 1990|, see Section 2.3 


for details. Intuitively, we expect that a causal mechanism is a property of the physical world, and it does 
not depend on what we feed into it. If the input (which in this case coincides with the covariates) shifts, the 
mechanism should thus remain invariant [Hoover 1990, Janzing and Scholkopf 2010 Peters et al. 2015 


In the anti-causal direction, however, a shift of the input usually leads to a changing conditional [Scholkopf 


et al. 2012 . In practice, prediction problems are often not causal — we should allow for the possibility that 
the set of predictors contains variables that are causal, anticausal, or confounded, i.e., statistically dependent 
variables without a directed causal link with the target variable. We thus expect that there is a subset S* 
of predictors, referred to as an invariant set, for which the covariate shift assumption holds true, i.e., the 
conditionals of output given predictor Y^ \ X|. are invariant across k G {1,... ,D,T}. If S'* is a strict subset 
of all predictors, this relaxes full covariate shift. We prove that in this case, knowing S* leads to robust 
properties for DG. Once an invariant set is known, traditional methods for covariate shift can be applied as 
a black box, see Figure Finally, note that in this work, we concentrate on the linear setting, keeping in 
mind that this has specific implications for covariate shift. 


1.4 Organization of the paper 

Section formally describes our approach and its underlying assumptions; in particular, we assume that 
an invariant set S* known. For DG, we prove in Section [2.1| that predicting using only predictors in S* 
is optimal in an adversarial setting. In MTL, when additional labeled examples from T are available, one 
might want to use all available features. Section [2?^ provides a method to address this. We discuss a link to 
causal inference in Section [2.3[ Often, an invariant set S* is not known a priori. Section [^presents a method 
for inferring an invariant set S* from data. Section contains experiments on simulated and real data. 


2 Exploiting invariant conditional distributions in transfer learn¬ 
ing 

Gonsider a transfer learning regression problem with source tasks P'^,...,P^, where (X^,y^) ~ P^ for 
k G {I,..., D}j^ We now formulate our main assumptions. 

(Al) There exists a subset S* C {I,... ,p} of predictor variables such that 

I X|. = I X|', Vk,k'G Dj. (1) 

We say that S* is an invariant set which leads to invariant conditionals. Here, = denotes equality in 
distribution. 

^We assume throughout this work the existence of densities and that random variables have finite variance. 
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(Al): 3S* C {1,... ,p} : Y \ Xg* invariant. 
Covariate shift holds: Y \ X{i pj. invarianhV 


Use methods for covariate shift, applied to 
Here, (A2): linear model 



Figure 1: Assumption (Al) (blue) is a relaxation of covariate shift (orange): the covariate shift assumption 
is a special case of (Al) with S* = {1,... ,p}. Given the invariant set S*, methods for covariate shift can be 
applied. 


(Al’) This invariance also holds in the test task T, i.e., 0 holds for all fc, fc' S {1,..., D,T}. 

(A2) The conditional distribution of Y given an invariant set S* is linear: there exists a S I and a 

random variable e such that for all k G {1,... ,D}, \Y^ \ X|. = x\ = a‘a; + e^, that is Y^ = a‘X|.. +e^, 
with X X|. and for all fc G {1,..., £>}, = e. 


Assumptions (AT) is stronger than (Al) only in the DG setting, where, of course, (AT) and (A2) imply the 
linearity also in the test task T. While Assumption (Al) is testable from training data, see Section i (AU) 
is not. In covariate shift, one usually assumes that (Al’) holds for the set of all features. Therefore, (AT) 
is a weaker condition than covariate shift, see FigureWe regard this assumption as a building block that 
can be combined with any method for covariate shift, applied to the subset S*. It is known that it can be 
arbitrarily hard to exploit the assumption of covariate shift in practice Ben-David et al., 2010 . In a general 


setting, for instance, assumptions about the support of the training distributions ,..., and the test 


distribution must be made for methods such as re-weighting to be expected to work [e.g. Gretton et al. 


2009 . The aim of our work is not to solve the full covariate shift problem, but to elucidate a relaxation of 


covariate shift in which it holds given only a subset of the features. We concentrate on linear relations (A2), 
which circumvents the issue of overlapping supports, for example. 

For the remainder of this section, we assume that we are given an invariant subset S* that satisfies (Al) 
and (A2). Note that we will also require (AF) for DG. 

We show how the knowledge of S* can be exploited for the DG problem (Section 2.1) and in the MTL 
case (Section [0|. Here and below, we focus on linear regression using squared loss 


T\2 


£ pt {( 3 ) = E(xT,yT).^pT(y^ - /3*X^) 


( 2 ) 


(the superscript T corresponds to the test task, not to be confused with the transpose, indicated by super¬ 
script t). We denote by (/3) the squared error averaged over the training tasks fc G {1,..., D}. 


2.1 Domain generalization (DG): no labels from the test task 

We first study the DG setting in which we receive no labeled examples from the test task during training 
time. Throughout this subsection, we assume that additionally to (Al) and (A2) Assumption (Al’) holds. 
It is important to appreciate that (AF) is a strong assumption that is not testable on the training data: it 
is an assumption about the test task. We believe no nontrivial statement about DG is possible without an 
assumption of this type. 

Now, we introduce our proposed estimator, which uses the conditional mean of the target variable given 
the invariant set in the training tasks. We prove that this estimator is optimal in an adversarial setting. 

Proposed estimator. The optimal predictor obtained by minimizing 0 is the conditional mean 

:= argmin £pT{f3), (3) 
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which is not available during training time. Given an invariant set S* satisfying (Al), (Al’) and (A2), we 
propose to use the corresponding conditional expectation as a predictor, that is 


RP -)■ K 

X E[ri|X^. =xs.] 


and write x := E[Y^ \ = xs*]. 


(4) 


The components of ^ outside S* are defined to be zero. Because of (Al), the conditional expectation 

in Q is the same in all training tasks. In the limit of infinitely many data, given a subset S, is obtained 

by pooling the training tasks and regressing using only features in S. In particular, := jg 

the estimator obtained when assuming traditional covariate shift. 


Optimality in an adversarial setting. In an adversarial setting, predictor Q satisfies the following 
optimality condition; as for the other results, the proof is provided in Appendix]^ We state and prove a 
more general, nonlinear version of Theorem in Appendix |A.I[ 

Theorem 1 Consider (X^, ~ P^,..., (X^, Y^) ^ P^ and an invariant set S* satisfying (Al) and (A2). 

The proposed estimator satisfies an optimality statement over the set of distributions such that (Al’) holds: 
we have 

pcsisn 

G argmin sup £pt(/3), 

/3gRp prg-p 

where '> is defined in Q and V contains all distributions over (X^, T^), T = D + 1, that are absolutely 

continuous with respect to the same product measure p, and satisfy | X^, = Y^ \ X^,. 

Unlike the optimal predictor the proposed estimator Q can be learned from the data available in the 
training tasks. Given a sample (Xj, U/'),..., (X^^, from tasks k G we can estimate the 

conditional mean in Q by regressing Y'^ on X|» . Due to (Al), we may also pool the data over the different 
tasks and use (Xj,T^~, ..., (X(^^,Y^^), (XfjY^), ..., (X((^,Yj^) as a training sample for this regression. 

One may also compare the proposed estimator with pooling the training tasks, a standard baseline in 
transfer learning. Focusing on a specific example. Propositionin the following paragraph shows that when 
the test tasks become diverse, predicting using Q outperforms pooling on average over all tasks. 


Comparison against pooling the data. We proved that the proposed estimator Q does well on an 
adversarial setting, in the sense that it minimizes the largest error on a task in V. The following result pro¬ 
vides an example in which we can analytically compare the proposed estimator with the estimator obtained 
from pooling the training data, which is a benchmark in transfer learning. We prove that in this setting, the 
proposed estimator outperforms pooling the data on average over test tasks when the tasks become more 
diverse. Let X|, be a vector of independent Gaussian variables in task k. Let the target Y^ satisfy 

Y^ = a*X^. -h , (5) 


where for each k G {I,..., D}, is Gaussian and independent of X|,. We have X^ 


(X|.,Z*^), where 




for some 7 ^ G K and where is Gaussian and independent of Moreover, assume that the training 

tasks are balanced. 

We compare properties of estimator ) defined in Equation Q against the least squares estimator 

obtained from pooling the training data. In this setting, the tasks differ in coefficients 7 ^, which are randomly 
sampled. We prove that the squared loss averaged over unseen test tasks is always larger for the pooled 
approach, when coefficients 7 ^ are centered around zero. In the case where they are centered around a 
non-zero mean, we prove that when the variance between tasks (in this case, for coefficients 7 ^) becomes 
large enough, the invariant approach also outperforms pooling the data. 


®Using the notation introduced later in Section 2.3 


this corresponds to a Gaussian SEM with DAG shown in 


Fig. I 
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1 2 3 4 5 6 7 


Figure 2: The figure shows expected errors for the pooled approach and the proposed method, see Equa¬ 
tion ([^. /i = 0. We consider two training tasks over 10,000 simulations. In each, we randomly sample 
the variance of each covariate in X, the variance of 77 , and 7 . is the same in all tasks. As predicted by 
Proposition Proposition observe that the error from the pooled approach (red) is systematically higher 
than the error from the prediction using only the invariant subset (blue), and both the error and its variance 
become large as the variance of coefficients 7 ^ increases. 


Proposition 1 Consider the model described previously. Moreover, assume that the tasks differ as follows: 
the coefficients 7 ^,... , 7 ^, 7 ^ = 7 ^+^ are i.i.d. with mean zero and variance Tf > 0. Then the least squares 
predictor obtained from pooling the D training tasks satisfies: 

(fpT (/3'^®)) > E^t (SpT (6) 

In particular, this implies the following: 

{SpT {/3^^)) > ( 7 ) 

Moreover, if the mean p, is non-zero, holds for fixed 7 ^,..., 7 ^ if T? > P{p), where P is a polynomial 
in p, see Appendix \A.I^ for details. 

The proof of Proposition [T] can be found in Appendix |A.2[ Figure [^visualizes Proposition [T] for two training 
tasks, it shows the expected errors for the pooled and invariant approaches, see §) , as the variance E^ 
increases. Recall that E^ corresponds to the variance of coefficients 7 ^, and thus indicates how different the 
tasks are. The expected errors are computed using the analytic expression found in the proof of Propositionj^ 
As predicted by Proposition the expected error of the pooled approach always exceeds the one of the 
proposed method (the coefficients 7 ^ are centered around zero), see Equation Q. As E^ tends to zero, 7 ^ 
is close to zero in all tasks, which explains the equality of both the pooled and invariant errors for the limit 
case E approaching 0. For coefficients 7 ^ centered around a non zero value, Equation Q does not necessarily 
hold for small E^. 

Proposition presents a setting in which the invariant approach outperforms pooling the data when the 
test errors are averaged over 7 , i.e. E..yT (£pT (/3'^'^)) > E..yT [Spr (/3'"‘®)). It is also clear to see that the 
equality of the distribution of in Equation (j^ for all fc G {1, • ■ ■ ,D} leads to Var.y (fpT ))) = 0, 

thus our invariant estimator minimizes the variance of the test errors across all related tasks. 
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2.2 Multi-task learning (MTL): combining invariance and task-specific infor¬ 
mation 

In MTL, a labeled sample (X.f is available from the test task and the goal is to transfer knowledge 

from the training tasks. As before, we are given an invariant set S* satisfying (Al) and (A2). Can we 
combine the invariance assumption with the new labeled sample and perform better than a method that 
trains only on the data in the test task? According to (Al) and (A2), the target satishes = q;*X|, + e^, 
where the noise has zero mean and finite variance, is independent of Xg, and has the same distribution in 
the different tasks k £ {1,... ,D,T}. Our objective is to use the knowledge gained from the training tasks to 
get a better estimate of j3°P* defined in Equation We describe below a way to tackle this using missing 
data methods. 

Missing data approach In this section, we specify how we propose to tackle MTL by framing it as a 
missing data problem. While the idea is presented in the context of AMTL, it can be used for SMTL in the 
same way. In order to motivate the method, assume that for each k £ {1,..., D,T}, there exists another 
probability distribution with density having the following properties: (i) when restricted to (X|., T^), 

coincides with and (ii) the conditional mapping g(y | xs*, x^r) is the same in all tasks. Property ii) 
implies that if (X|,,X^,y^) drawn from was available for each task, we could learn a joint regression 
model of all the training tasks and the test task to predict Y given X 5 . and X^r. The following Proposition 
shows the existence of such distributions. 

Proposition 2 Let S* be an invariant set verifying (Al) and (A2). For k £ {!,...,H,T}, denote by 
(x,y) I— p'^{K,y) the density o/P^. Then there exists a function q : ML ^ 'R such that for each k £ 
{!,...,H,r}, there exists a distribution with density such that for all (x,y) £ for all k £ 

i) q^{xs»,y) =p'^{xs-,y), 
a) ( 7 '=(y|xs*,xjv) = y(y|xs.,xjv). 

The proof for Proposition]^ can be found in Appendix |A.3[ Following the previous intuition, for the training 
tasks k £ {!,..., D}, we hide the data of X^ and pretend the data in each task k £ {1,..., D,T} come from 
Q^. Note that some of the data are only missing for the training tasks. More precisely, X^ is missing for 
k £ {I,...,£)}, while because of i) in Proposition § (x|.,y'=) is available for all tasks k £ {1,... ,D,T}. 
We thus pool the data and learn a regression model of Y versus (Xg*, X^r) by maximizing the likelihood of 
the observed data. 

We formalize the problem as follows. Let (z,)r=i = (xs.,„x^,, be a pooled sample of the 

available data from the training tasks and the test task, in which X^v^i is considered missing if X^ is drawn 
from one of the training tasks. Here, n = i® ^^e total number of training and test examples. Denote 

by 'Lobs,i tbe components of which are not missing. In particular, 'Ziobs,i = Z^ if i is drawn from the test 
task and 7jobs,i = otherwise. Moreover, let E be a (p + I) x (p + I) positive definite matrix, and 

Si is the submatrix of E which corresponds to the observed features for example i. If example i is drawn 
from a training task, E^ is of size (IIS’*! + I) x {\S*\ + I), and (p+ I) x (p+ I) otherwise. The log-likelihood 
based on the observed data for matrix E satisfies: 

I n ^ 

£(E) = const - 2 XI “ 2'^lbs,i'^7^'^obs,i, ( 8 ) 

and our goal is to find E which maximizes ([^. 

When all data are observed, the least squares estimator can be seen as the result of a two step 
procedure. First, ([^ is maximized for the sample covariance matrix. Then, one computes the conditional 
mean EjT | X = x] of the estimated joint distribution of (X,y). In the case of missing data, however, the 
sample covariance matrix does no longer maximize (|^, see paragraph ‘A naive estimator for comparison’ 
below. Instead, we maximize Q using EM. 
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Chapter 11 in Little and Rubin 1986 provides the update equations for optimizing Equation (|^ using 
EM. More precisely, given an estimate Y7 of the covariance matrix at step r, the algorithm goes as follows. 
E step: For an example i in the test task, we define 


Zi if example i is from the test task, 
(Xs.,i,E(X^ I Zobs,i),y%) otherwise. 


Here, we are essentially imputing the data for Xjv in the training tasks by the conditional mean given 
the observed data, using the current estimate of the covariance matrix Y7. The conditional expectation is 
computed using the current estimate S’" and the Gaussian conditioning formula: 


E(X 


N 


■Jobs, I 


= E 


NZo 




where is the submatrix of E’’ corresponding to the cross-covariance between X^r and (X 5 *,F), and 

E^ ^ is the submatrix corresponding to the covariance of (X 5 . ,y). Moreover, define 


C r ._ 

NA ■ — 


I 0 if example i is from the test task. 


) otherwise. 


\Cov(X^ I = E^ - 

M step: compute the sample covariance given the imputed data: 

= I = r E + c- 


where is a (p -I- 1) x (p -|- 1) matrix whose submatrix corresponding to features in N is C]^^, and the 
remaining elements are 0. The intuition for the M step is simple: we compute the sample covariance with 
the values imputed for Xjv- Since these values are being imputed, matrix C adds uncertainty for the 
corresponding values. 

Once the algorithm has converged, we can read off the regression coefficient from the joint covariance 
matrix as E[E | X 5 . = X5.]. The whole procedure is initialized with the sample covariance matrix computed 
with the available labeled sample from T. 


Incorporating unlabeled data The previous method also allows us to incorporate unlabeled data from 
the test task. Indeed, assume that an unlabeled sample X^ = (X^,, X|^) from the test task is also available 
at training time. This can be incorporated in the previous framework since the label Y can be considered 
to be missing. We can then write Z[ = (Xs*_i, Xjv^i, E(yj’' | Zobs,i)) for the unlabeled data, thus imputing 
the value of Y in in the E-step by the conditional mean given (Xg* i, Xjv,i)- The added covariance is then 
Cyi = Var(y)’' — T,YZabs^'^z„hs^~^'^ZobsY- The rest of the algorithm remains unchanged. 

A naive estimator for comparison In the population setting. Proposition in Appendix |A.4| provides 
an expression for (3°^^ as a function of a and e from Assumption (A2). As in the previous paragraph, one 
could try to estimate the covariance matrix of (X, Y) using the knowledge of a and e from the training tasks, 
and then read off the regression coefficients. In the presence of a finite amount of labeled and unlabeled data 
from the test task, a naive approach would thus plug in the knowledge of a and e as follows: the entries 
of Ex,v that correspond to the covariances between Xg. and Y are replaced with Ex^, • a, and the entry 
corresponding to the variance of Y is replaced by a^Ex^.o: -I- Var(e). This, however, often performs worse 
than forgetting about a and using the data in the test domain only, see Figure]^ (left). Why is this the case? 
The naive solution described above leads to a matrix E that does not only not maximize ([^ but that often 
is not even positive definite. One needs to optimize over the free parameters of E, which corresponds to the 
covariance between X^ and Y, given the constraint of positive definiteness. For comparison, we modified 
the naive approach as follows. First, we find a positive definite matrix satisfying the desired constraints. In 
order to do this, we solve a semi-definite Program (SDP) with a trivial objective which always equals zero. 







Then, we maximize the likelihood Q over the free parameters of S with a Nelder-Mead simplex algorithm. 
The constrained optimization problem can be shown to be convex in the neighbourhood of the optimum 
, Sec. 3] if the number of data in the test domain grows. While gradients can be 
computed for this problem, gradient-based methods seem to perform poorly in practice (experiments are not 
shown for gradient based methods). 

In an idealized scenario, infinite amount of unlabeled data in the test and labeled data in the training tasks 
could provide us with Ex, 51 (Xg, ,y) Var(y). We could then plug in these values into E and optimize over 
the remaining parameters, see ygcau-i-.j.d. Figure (left). In practice, we have to estimate Ex, E(Xs.,v) 
and Var(E) from data. Thus, the EM approach mentioned above constitutes the more principled approach. 


Zwiernik et al. 2014 


2.3 Relation to causality 

In this section, we provide a brief introduction to causal notions in order to motivate our method. More 
specifically, we show that under some conditions, the set S* of causal parents verifies Assumptions (Al) and 
(Al’). Structural equation models (SEMs) [Pearl 2009] are one possibility to formalize causal statements. 
We say that a distribution over random variables X = (Ai,..., Ap) is induced by a structural equation 
model with corresponding graph Q if each variable Xj can be written as a deterministic function of its 
parents PA® (in Q) and some noise variable Nj: 


Xj=fjiX^A^,N,), j = l, 


,P- 


(9) 


Here, the graph is required to be acyclic and the noise variables are assumed to be jointly independent. An 
SEM comes with the ability to describe interventions. Intervening in the system corresponds to replacing 
one of the structural equations (H). The resulting joint distribution is called an intervention distribution. 
Changing the equation for variable Xj usually affects the distribution of its children for example, but never the 
distribution of its parents. Consider now an SEM over variables (X, Y). Here, we do not specify the graphical 
relation between Y and the other nodes: Y may or may not have children or parents. Suppose further that 
the different tasks ,..., P'° are intervention distributions of an underlying SEM with graph structure Q. 
If the target variable has not been intervened on, then the set S* := PA® satisfies Assumptions (Al) 
and (Al’). This means that as long as the interventions will not take place at the target variable, the set S* 
of causal parents will satisfy Assumptions (Al) and (Al’). 

have given several sufficient conditions for the identifiability of the causal 


Recently, Peters et al. 2015 


parents in the linear Gaussian framework. E.g., if the interventions take place at informative locations, or 
if we see sufficiently many different interventions, the set of causal parents is the only set S* that satisfies 
Assumptions (Al) and (Al’). If there exists more than one set leading to invariant predictions, they consider 
the intersection of all such subsets. In this sense, seeing more environments helps for identifying the causal 
structure. In this work, we are interested in prediction rather than causal discovery. Therefore, we try to 
find a trade-off between models that predict well and invariant models that generalize well to other domains. 
That is, we are interested in the subset which leads to invariant conditionals and minimizes the prediction 
error across training tasks. 

If the tasks P^ correspond to interventions in an SEM, we may construct an extended SEM with a parent¬ 
less environment variable E that points into the intervened variables. Then, P^ equals the distribution of 


(X,A) \ E = k, see Peters et al., 2015, Appendix C]. If the distribution of {X,Y^E) is Markov and faithful 
w.r.t. the extended graph, the smallest set S that leads to invariant conditionals and to best prediction is a 
subset of the Markov blanket of Y : certainly, it contains all parents of A; if it includes a descendant of A, 
this must be a child of A (which yields better prediction and still blocks any path from A to E); analogously, 
any contained ancestor of a child of A must be a parent of that child. 


3 Learning invariant conditionals 

In the previous section, we have seen how a known invariant subset S* C {1,... ,p} of predictors leading to 
invariant conditionals A*’ | X|,, see Assumption (Al), can be beneficial in the problems of DC and MTL. 
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In practice, such a set S* is often unknown. We now present a method that aims at inferring such a subset 
S* from data. Throughout this paper, we denote by S any subset of features, while S* is an invariant set 
(which is not necessarily unique). The method we propose provides an estimator S' of 5*. It is summarized 
in Algorithmic code is provided in https://bitbucket.org/mrojascarulla/subsets_sub. 

3.1 Our method. 


Algorithm 1: Subset search 


Inputs: Sample (x/, for tasks fc G {1,..., D}, threshold S for independence test. 

Outputs: Estimated invariant subset S. 

1 Set Sacc = {}, MSE = {}. 

2 for S C {1,... ,p} do 

linearly regress Y on Xs and compute the residuals Rpcs(s) on a validation set. 
compute H = HSIC;, ((i?^cs(s)_j, and the corresponding p-value p* (or the p-value from an 


alternative test, e.g. Levene test.), 
compute the empirical estimate of on a validation set. 

if p* > (5 then 

I S'acc.add(S'), MSE.add(5pi.....D(^‘='^(^))) 

end 


9 end 

10 Select S according to RULE, see Section 3.4 


Algorithm 2: Greedy subset search 


1 

2 

3 

4 

5 

6 

7 

8 
9 

10 

11 

12 

13 

14 

15 

16 


Inputs: Sample (x/, for tasks fc G {1,... ,D}, threshold S for independence test. 

Outputs: Estimated invariant set Sgreedy. 

Set 

Sacc — {}, S,urrent{}, MSE = {}. 
for J G {1,. . niters} do 
SgIj stOitf-nin — OO. 

for S G do 

^current 

linearly regress Y on X 5 and compute the residuals Rpcs{s) on a validation set. 
compute H = HSICf, [fR^cs^s) ^i, and the corresponding p-value p* (or the p-value 

from an alternative test, e.g. Levene test.). 

compute .the empirical estimate of on a validation set. 

if p* > (5 then 

I S'acc.add(S'), MSE.add(5pi....,D(/3^^(^))), set 

^current — S • 

end 

else if < statmin then 

I set S current — *5*, StO/tuiin — R-. 

end 

end 

end 


Select S according to RULE, see Section 


3.4 


Consider a set of D tasks, a target variable Y^ and a vector X^ of p predictor variables in task k. For 
/3 G ML, we define the residual in task k as: 


= yfe -/3‘X*', fc G {!,..., £1}. 


( 10 ) 
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By Assumptions (Al) and (A2), there exists a subset S* and some vector ^ such that for all j ^ S 

CS{S") „ , d 


p: 


— 0 and -R^cs(s*) 


— 

— ^^CS(S*)- 


Such a set S* is not necessarily unique. As stated in 


Peters 


et al., 2015 , the number of invariant subsets decreases as more different tasks are observed at training time. 


We propose to do an exhaustive search over subsets S of predictors and statistically test for equality of the 
distribution of the residuals in the training tasks, see the section below. Among the accepted subsets, we 
select the subset S which leads to the smallest error on the training data. This is a fundamental difference 
to the method proposed by Peters et al. 2015|. Indeed, while our method addresses the transfer problem, 


Peters et al. 2015 is about causal discovery. Algorithm finds an invariant subset which also leads to the 


lowest validation error. This subset may contain covariates which are non causal. On the other hand, Peters 


et al. 2015 estimate the causal parents (with coverage guarantee). Such an approach has a different purpose 
and performs very badly both in DG and MTL: e.g. when all tasks are identical, it uses the empty set as 
predictors. 

In Section [3.3[ we propose two solutions for when the number of predictors p is too large for an exhaustive 
search: a greedy method and variable selection. While the algorithms are presented using linear regression, 
the extension to a nonlinear framework is straightforward. In particular, linear regression can be replaced 
by a nonlinear regression method. 


3.2 Statistical tests for equality of distributions. 


In order to test whether a subset S leads to invariant conditionals, we can use a statistical test to check 
whether the residuals i?^cs(s) have the same distribution in all tasks k G We propose two 

possible methods. 

to test whether the residuals have the same 


For Gaussian data, one can use a Levene test Levene 1960 


but to our 
We 


variance in all tasks; their means are zero as long as an intercept is included in the regression model. 

As an alternative, we propose a nonparametric H-sample test by testing whether the residuals are in¬ 
dependent of the task index. This test is a direct application of HSIG Gretton et al. 2007 
knowledge, is novel. Suppose that the index of the task can be considered as a random variable K. 
consider the sample Z = {Rps i, as drawn from a joint distribution over residuals and task indices, 

where n = ^md Ki G {1,..., D} is a discrete value indicating the index of the corresponding task. 

The residuals have the same distribution in all training tasks if and only if Rps and K are independent. Two 
characteristic kernels are used: a kernel k is used for embedding the residuals and a trivial kernel d such 
that d{i,j) = Sij is used for K. Let therefore HSIC(i?^s, AT) denote the value of the HSIG 
2007] between Rps and K, and let HSIC;,(.Z) be the corresponding test statistic. A subset 
if leads to accepting the null hypothesis of independence between and K at level 5. 

Both in the case of the Levene test and the D-sample test, the test outputs a p-value p*, and we accept 
the null Hq if p* > 5. Among these accepted subsets, we output the set S which leads to the smallest loss 
on a validation set. The test level 5 is given as an input to our method and allows for a trade-off between 
predictive accuracy and exploiting invariance. As <5 tends to zero, the null is accepted for all subsets and we 
then select all features, which is equivalent to covariate shift. When 5 approaches one, no subset is accepted 
as invariant. Our method then reduces to the mean prediction. In order to compute p-values, a Gamma 
approximation is used for the distribution of HSIG{,(Z) under the null. 


Gretton et al. 

)' is accepted if 


3.3 Scalability to a large number of predictors 

When the number of features p is large, full subset search is computationally not feasible. We propose two 
solutions for this scenario. If one has reasons to believe that the signal is sparse, that is the true set S'* 
is small, one may use a variable selection technique such as the Lasso [Tibshirani] 1996 as a first step. 
Under the assumptions described in Section [2.3[ we know that the invariant set with the best prediction in 
the training tasks can be assumed to be a subset of the relevant features (which here equals the Markov 
blanket of F). Thus, if variable screening is satisfied ,i.e., one selects all relevant variables and possibly more, 
the pre-selection step does not change the result of Algorithm 1 in the limit of infinitely many data. For 
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Figure 3: Example of a directed acyclic graph, see Section |2.3 
Y I Xi, X 2 , X 3 remains invariant. 


If Y is not intervened on, the conditional 


estimator 


■C S{cau) 


description 


P 

pCS(S) 


pcs 

pSt 

pmean 

pdom 

pMTL 

pDICA 

^mDA 


Linear regr. with true causal predictors (often unknown in practice). 

Finding the invariant set S using full subset search and performing lin. regr. using predictors in S. 
Sgreedy corresponds to finding the invariant set using a greedy procedure. SLasso corresponds to 
doing variable selection using Lasso as a first step, then doing full subset search on the selected 
features. 

Pooling the training data and using linear regr. 

Finding the invariant set S using full subset search and maximizing ^ for MTL using EM. 

Pooling the training data and outputting the mean of the target. 

Linear regression using only the available labeled sample from T. 


Multi- task feature learning e stimator Argyriou et al. 

with rbf kernel. 


Muandet et al. 


2013 


2007a 


DICA 

Pooling the training data and an unlabeled sample from T, learning features using mSDA 


Chen et al. 


2012 with one layer and linear output, then using linear regr. 


Table 2: Estimators used in the numerical experiments. A ’-b’ next to a subset S corresponds to the method 


for MTL described in the last paragraph of Section 2.2 


linear models with ii penalization, variable screening is a well studied problem, see e.g. compatibility and 
Pmin conditions Biihlmann and van de Geer 2011, Chapter 2.5]. 

Alternatively, one may perform a greedy search over subsets when full subset search is not feasible. 
Denote by Ss the collection of neighbouring sets of a set S obtained by adding or removing exactly one 
predictor in S. If no subset has been accepted at a given iteration, we select the neighbor leading to the 
smallest test statistic. If a neighbor is accepted, we select the one which leads to the smallest training error. 
We start with the p subsets with only one element, and allow to add or remove a single predictor at each 
step, see Algorithmic As often for greedy methods, there is no theoretical guarantee. 


3.4 Subset selection in MTL 

In DG, among the accepted subsets, we select the set S which leads to the lowest validation error. In MTL, 
however, a labeled sample from the test task T is available at training time. Therefore, Algorithm is 
slightly modified. First, we get all the sets for which Hq is accepted. Then, we select the accepted set S 
which leads to the smallest 10 fold cross validation error. For each subset, we compute the least squares 
coefficients using the procedure described in Section [TC and measure the prediction error on the held out 
validation set. Using the notation of Algorithm [T} let Sacc be the set of subsets accepted as invariant, and 
let MSE be the set of their corresponding squared errors on the validation set. The following rules are used 
for selecting an invariant set in DG and MTL. 

i) RULE for DG: Return S = S'acc[arg min MSE]. 

ii) RULE for MTL: Define CVacc = {}■ For each set S C Sacc, do C'Uacc-add(C'Us), where CVs is 
the 10-fold cross validation error over the labeled test data obtained by optimizing Q using EM with 
subset S. 

Return S = ^acc [arg min CKcc]- 
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Log MSE 


4 Experiments 

We compare our estimator to different methods, which are summarized in Table pCSicau) 

uses the ground 

truth for S* when it is available, corresponds to full search usin g Algorithm [ij us es the pooled 

for the MTL 


2007a 


training data, performs the Multi-task feature learn ing algorithm Argyriou et al. 

setting and performs DICA [Muandet et al. 2013 for DG. For DICA, which is a nonlinear method, 

the kernel matrices are constructed using an rbf kernel, and the length-scale of the kernel is selected according 
to the median heuristic. In the MTL setting, we combine the invariance with task specific information by 
optimizing ® using EM, resulting in regression coefficients and when the ground truth is known. 

Finally, ^caSf.c/L indicates that unlabeled data from T was also available. For reference. Figure (left) 
provides results for and whic h cor respond to the estimators described in the last paragraph ‘A 

^^cau+ ground truth for S* and a). (infinite 


2.2 


naive estimator for comparison’ of Section 
data) also assumes that we know the ground truth for the entries of the covariance matrix for the test task 
corresponding to the covariance of X, the covariance between Xg. and Y, and the variance of Y. 


4.1 Synthetic data set 





cn 

O 


Number D of training tasks 


Figure 4: DG setting. Logarithm of the empirical squared error in the test task for the different estimators 
in the DG setting. The results show averages and standard deviations over 100 repetitions. We vary the 
number of tasks D available at training time. Left: both S and N are of size 4, such that X is 8 -dimensional. 
Middle: 32 noise variables are added to X. Variable selection using the Lasso is used prior to computing 
pCS{S)^ while p^^i^sreedy) .j^ggg predictors. Right: both S and N are of size 20. Full search is not 
computationally feasible in this setting. 

In this section, we generate a synthetic data set in which the causal structure of the problem is known. 
The number of predictors p varies in each experiment. As for all experiments, we choose S = 0.05 as a 
rejection level for the statistical test in Algorithms and For each task k G {1,2,D,T}, we sample 
a set of causal variables from a multivariate Gaussian X|, ~ A7(0, ) where the covariance matrix 

is computed as Ug,{Ug,Y where Ug, is a (|S'|, j^l) matrix of uniformly distributed random variables with 
values between —2 and 2. The target variable Y^ is drawn as = q;X|, -|- where ~ ^(0,2) (the 
standard deviation of is 6 for the non sparse DG experiment with 40 predictors, see the right of Figured. 
Finally, we sample the remaining predictor variables as X^ = -I- where ~ A/’(0, E^). 

and E^ are sampled similarly to E|,, with uniforms between —4 and 4 and —1 and 1 respectively, a is 
sampled from a uniform distribution 77(—1,2.5), while 7 ^ is sampled from a Student t-distribution for the 
DG problem, and a uniform between 0 and 1.5 for MTL. The vector a is the same in all tasks. 

Our goal is to linearly predict target using predictors X"^ = (Xg. ,X^) on the test task. Given 
regression coefficient (3, we measure the performance in the test task using the logarithm of the empirical 
estimator of £pt(/ 1 ). 

In Figure]^ we are in the DG setting (thus, no labeled examples from T are observed at training). 4000 
examples per training task are available. We report the average empirical MSE over left out test tasks. We 
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Figure 5: MTL setting. Percentage of repetitions (out of 100) for which the corresponding method outper¬ 
forms Both S and N are of size 3, such that X is 6-dimensional. Left: AMTL setting. This plot 

shows that the methods presented in in Section |2.2| do not perform very well, even in the presence of a 
large amount of data: 50000 unlabeled examples from T and 36000 training examples are available. Middle: 
still in the AMTL setting, we fix the number of training data (300 per task, 50 in the test task) and vary 
the amount of unlabeled data available. We report the percentage of scenarios in which the corresponding 
method outperforms this time (no unlabeled data). While always performs worse than 

and does not exploit the unlabeled data, we see that ^'=outt,c/L performs better as the amount of unlabeled 
data increases. Right: SMTL setting, and we vary the number of labeled examples available in each training 
task. In this setting, the methods using unlabeled data were given 100 unlabeled examples. 


study both sparse and non sparse settings (in which full search is not feasible). On the left and middle, we 
see that when more than five training tasks are available, both the full search and greedy approaches are able 
to recover an invariant set, and outperform pooling the data for any number of training tasks. When more 
than five training tasks are observed, performs like pC;S{cau)^ which uses knowledge of the ground 

truth. On the right, full search is not feasible, and outperforms other approaches. 

Figure]^ (left), considers an AMTL setting, in which large amounts of labeled data (36000) from the 
training tasks and unlabeled data from the test task (50000) are available. Both S and N are of size 3, such 
that X is 6-dimensional. We report the percentage of simulations for which the population MSE of a given 
approach outperforms pd-°'^_ We see that systematically outperforms Moreover, and 

P^'^ also perform well, and positive transfer is effective. However, a prohibitively large amount of labeled 
and unlabeled data is needed for these approaches. In a setting with only 300 examples per training task 
and 50 examples from T, we plot in Figure]^ the histogram of the error difference A = £(/3'^°"‘) — £{P) for 
and Figure i corresponds to the same setting, but we vary the number of unlabeled 
data available ^e only plot methods that use unlabeled data, and is used as reference instead of 

pdomy Figure^ (right) considers an SMTL setting in which no unlabeled data is available, and only few 
labeled examples are available in each task. Here, we see that p^'^ and perform well, while 

other methods do not. 


4.2 Gene perturbation experiment 


We apply our method to gene perturbation data provided by Kemmeren et al. 2014 . This data set consists 


of the m-RNA expression levels of p = 6170 genes Xi,...,Xp of the Saccharomyces cerevisiae (yeast). 
It contains both Uobs = 160 observational data points and Uint = 1479 data points from intervention 
experiments. In each of these interventions, one known gene (out of p genes) is deleted. In the following, 
we consider two different tasks. The observational sample is drawn from the first task, and the pooled riint 
interventions are drawn from the second task. 
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Figure 6: In the AMTL setting, 300 examples from each of the training tasks and 50 from the test task are 
available. For ^caut),(7 L^ ^qq ^nlabeled examples from T are also available. We run 1000 repetitions and plot 
the histograms of A = £’(/3'^°™) — £(/3). The dashed line indicates the number of examples for which 
outperforms our methods, which correspond to the following fraction of the repetitions: 0.695 for 
0.798 for ^cauD,(7L 0.631 for Thus, the proposed estimators outperform We also note that 

errors in computing an invariant set occasionally lead to small values of A for /3'®^ (large errors). 




Figure 7: Example of the expression of pairs of genes, where A is causal (left) and B is non-causal (right) 
of target Y. The blue points are from the observational sample (task 1), the red dots are the interventional 
sample (task 2), and the green point corresponds to the single interventions in which A and B are intervened 
on respectively. On the left, a model learnt on the data in red and blue would still perform well on the 
intervention point, which is not the case on the right. 


Motivation In order to gain an intuition about the experiments we are presenting, consider Figure We 
select as a target a gene Y out of the p genes, and our goal is to predict the activity of Y given the remaining 
p — I genes as features. Some of these p — 1 genes are causal of the activation of Y. For example. Figure 
shows on the x-axis the activity of two genes (gene A on the left, gene B on the right) such that: 

• The expressions of A and B are strongly correlated with the expression of Y. 

• A is causal of Y (here, we use the definition of a causal effect proposed by Peters et al. 2015| ). 

• B is non-causal of Y (anticausal or confounded). 

In Figure (left), the blue points correspond to the 160 data points from the observational sample, which 
corresponds to the first task. The red dots are the 1478 data points from the interventional sample, except 
for the single data point for which A is intervened on, and constitute the second task. The plot on Figure 
(right) is constructed analogously for B. We can indeed see that in the pooled sample from task 1 and 2, A 
and B are both strongly correlated with target Y. 

The key difference between both plots are the green points. On Figure(left), the green dot corresponds 
to the single intervention experiment in which gene A is intervened on. Similarly, the green dot on Figure 
(right) is the single point in which B is intervened on. Our goal is to consider the DG setting in which the 
test task consists on this single intervention point. 
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For the causal gene A, one expects that a change in the activity of A should translate into a proportional 
change in the activity of Y. We observe that, in the particular example of the left plot, a linear regression 
model from AtoY trained only on the pooled data from tasks 1 and 2 (blue and red in Figure would lead 
to a small prediction error on the intervened point (in green). That is, S* = {A} might be a good candidate 
for a set satisfying Assumptions (Al), (Al’) and (A2). For the non-causal gene B, however, intervening on 
B leaves the activity of Y unchanged, and the linear model learnt on the data from tasks 1 and 2 performs 
badly on the test point in green. In such case, a candidate set is the empty set S'* = {}, leading to prediction 
using the mean of the target in the training data. A model which is aiming to test in these challenging 
intervention points should therefore include causal genes as features, but exclude non-causal genes. In these 
experiments, we aim at testing whether we can exclude non-causal genes such as B automatically. 


Setup We address the problem of predicting the activity of a given gene from the remaining genes. We 
are looking at the following: 

• We consider p different problems. In each problem j G {1,... ,p}, we aim at predicting the activity 
Y = Xj of gene j using {Xi)i^j as features. 

• In each problem j G {1,... ,p}, two training tasks k G {1, 2} are available. The data from the first 
task is the observational sample, and the data from the second task are all the Umt interventions (we 
shall subsequently remove some points for testing, see below). 

The goal is now to apply our method to each of the problems and estimate an invariant subset. Due to 
the large number of predictors, we first select the 10 top predictor variables using the Lasso and then apply 
Algorithm to select a set of invariant predictors S, see in Table We denote the indices of the 

features selected using Lasso by L = (Li,..., Lio)- 

The procedure is then evaluated as follows: for each problem j G {!,...,p}, we first find the genes 
in ..., Aijp) for which an interventional example is available. Note that this might not hold for all 

selected genes, since only riint < P interventions are available. We then iterate the following procedure (this 
is within the context of the same problem): for each gene in (Af^j,... for which an intervention is 

available, 

• we put aside the example corresponding to this intervention from the training data (in the motivation 
example, this would correspond to the green point). 

• we estimate an invariant subset SQL using Algorithm with the remaining observational and 
interventional data. 

• we test all methods on the single intervention point which was put aside. 

We expect two different scenarios, as explained in the motivation paragraph above: (1) if the intervened 
gene is a cause of the target gene, it should still be a good predictor (see Section 2.3); then, it should be 
beneficial to have this gene included in the set of predictors S. (2) if the intervened gene is anticausal or 
confounded (we refer to this scenario as non-causal), the statistical relation to the target gene might change 
dramatically after the intervention and therefore, one may not want to base the prediction on this gene. In 
order to see this effect and understand how the different approaches for DG in Table handle the problem, 
we consider two groups of experiments. 

(1) we select the target genes Y for which one of the features in L is causal for the activity of Y and for 
which an intervention experiment is available. 39 problems fall in this causal scenario. 

(2) out of the remaining problems we chose target genes with (non-causal) predictors that have been 
intervened on and — in order to increase the difficulty of the problem — that are strongly correlated 
with the target gene. We therefore select 269 cases for which a Pearson correlation test (the null 
hypothesis corresponds to no correlation) outputs a p-value equal to zero. 


Results Figure shows box plots for the errors of the different methods for the causal problems (1) on 
the left and for the non-causal problems (2) in the middle. We do not plot outliers in order to improve 
presentation. Figure (left) presents the causal scenario. As expected, pooling does well in this setting. 
Figure]^ (right) shows that in the non-causal problems (2), prediction using an invariant subset leads to less 
severe mistakes on test genes compared to pooling the tasks. 
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Figure 8: In the causal problems (left), interventions are performed on causal genes. As expected, the 
input genes continue to be good predictors, and 13^^ works well. In the non-causal problems (middle), one 
of the inputs is intervened upon and becomes a poor predictor, impairing the performance of 13'^^. The 
mean predictor uses none of the predictors, and therefore works comparatively well in this scenario. 

Our proposed estimator provides reasonable estimates in both the causal and non-causal settings, 

while other methods only perform well in one of the scenarios, performs similarly to in both 

scenarios, and is therefore outperformed by other methods in the causal problems (note that uses all 

available features). Right: in the non-causal scenario (2), we plot the number of test genes for which the 
squared error for P^^ is larger than t times the squared error for and vice-versa, where r is plotted 

on the x-axis. This plot shows the number of genes for which one of the method does significantly worse 
than the other. By this measure, pCS{SLasso) outperforms P^^ for all values of r. 


For comparison, since we know which predictors are being intervened on at test time, we included a 
method that makes use of causal knowledge: p(^S{cau) ^ggg predictors in the causal problems (1) and 

all but the intervened gene in the non-causal problems (2). In practice, this causal knowledge is often not 

available. We regard it as promising that the fully automated procedure pCS{SLasso) performs comparably 
to p(^S{cau) ^ 


5 Conclusions and further directions 

We propose a method for transfer learning that is motivated by causal modeling and exploits a set of 
invariant predictors. If the underlying causal structure is known and the tasks correspond to interventions 
on variables other than the target variable, the causal parents of the target variable constitute such a set 
of invariant predictors. We prove that predicting using an invariant set is optimal in an adversarial setting 
in DG. If the invariant structure is not known, we propose an algorithm that automatically detects an 
invariant subset, while also focusing on good prediction. In practice, we see that our algorithm successfully 
finds a set of predictors leading to invariant conditionals when enough training tasks are available. Our 
method can incorporate additional data from the test task (MTL) and yields good performance on synthetic 
data. Although an invariant set may not always exist, our experiment on real data indicates that exploiting 
invariance leads to methods which are robust against transfer. 

Extending our framework to nonlinearities seems straight-forward and may prove to be useful in many 
applications. For instance, we provide a general, nonlinear version of Theoreman Appendix]^ Moreover, 
Algorithms and are presented in a linear setting. However, the extension to a nonlinear framework is 
straightforward. In particular, the linear regression can be replaced by a nonlinear regression method. We 
expect that there may be feature maps leading to invariant conditionals that are different from a subset. We 
believe, finally, that the link to causal assumptions and the exploitation of causal structure may lend itself 
well to proving additional theoretical results on transfer learning. 
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A Proofs 


A.l A nonlinear extension of Theorem [T] 

The extension of Theorem to a nonlinear setting is straightforward. Given a subset S* leading to invariant 
predictions, the proposed predictor is defined as the conditional expectation 


RP ^ K 

X =E[yi|X^. =xs.]. 


( 11 ) 


The following theorem states that fs* is optimal over the set of continuous functions in an adversarial setting. 


Theorem 2 Consider D tasks (X^, ~ P^,..., (X^, Y^) ~ that satisfy Assumption (A1). 

fs* in (111 satisfies 

fs- e argmin sup E(xT_yT).^pT (Y^ - /(X^)) , 

/eco p'Cg-p ' ' 


Then the estimator 


where V contains all distributions over (X^,y^) that are absolutely continuous with respect to the same product 
measure fi and satisfy Y^ \ Xg. = Y^ \ Xg*. 


Proof. Consider a function / that is possibly different from /g*, 
construct a distribution P G "P such that 


see (111. For each distribution Q G V, we will now 


Jiy- fi^)f rfP > J{y- /s* (x))^ dQ . 

In this proof, we assume that the probability distributions in V are absolutely continuous with respect to Lebesgue 
measure. The extension to the case where they are absolutely continuous with respect to a same product measure 
pL is straightforward. Let us therefore assume that Q has a density (x, y) i— q['SL,y). Define P to be the distribution 
that corresponds to p{'x.,y) := q{xs*,y) ■ (/(xjv), where xw contains all components of x that are not in S*. In the 
distribution P, the random vector Xjv is independent of (Xg»,y). But then 


/ 


{y - /(x))" dP 


/ / 


{y- /(xs»,x)) p(xg.,i/)dxg. dyp{xN)dxN 


XN '^^S* ’V 


> 


/ / (y-/s*(xg.))^p(xg.,i/)dxg* di/p(xiv)dxiv 

J x^ ,y 


= / (y-/s*(xg.)) g(xs*,X]v,y)dxg. dj/dxiN 


= J (i/-/s*(x))^ 


□ 


A.2 Proof of Proposition 

We consider three variables and the following generative process: Y'^ = a*X|. + ^fc, _ ^ky'k _|_ where 
~ ~ A/'(0, cr^) and (Xg.)j ~ A/'(0, ((Tx)j). In this model, 7 *^ is the parameter responsible for the 

difference between the tasks, while the other parameters are shared between the tasks. 

At training time, D tasks are available. We first aim to obtain an explicit formula for the linear regression 
coefficients t Pz^) obtained from pooling all the training tasks together. Denote by X, Y and Z the 

pooled training data. For fixed ... , 7 ^, the expected loss in the training data satisfies for coefficient /? verifies: 

E((y-(/3x)‘X-/3z^)") = -^E(y''-(^x)‘x'=-/3zZ'') 

k=l 

= /3^diag(a|)/Ix + ^ [a^D + Vy^) + 2(/Iz J - l)a*diag(a|)/3x + Vy - 2^VyPz (12) 


20 




By differentiating (12 1 with respect to /3, we obtain the following expression for the pooled coefficients: 




7 a 


+ -Do-2 _ ^Q,«diag(o-|) 


and 


/3?.® 


_ /-I 7 aCSs 

(1 ^ Pz )^5 


where 7 ^ = X]feLi(7*^)^ s-nd 7 = 7 *’- Consider now an unseen test task with coefficient 7 ^^. The expected loss 

on the test task using the pooled coefficients is: 


= E ((y^ - = (/iS®)‘diag(ai)/l5® + 

+ 2/ig7^a‘diag(ai)/jg® + W 

- 2 a‘diag(ag)/?g^ - 2 / 3 g®W 7 ^. 


Therefore, the expectation with respect to 7 ^ is: 


(13) 


(Ept(/ 3^®)) = (/3g®)*diag(a|)/lg^ + (/3g®)" + a^) + Cy - 2a‘diag(ag)/3g® 


Denote by EpT(/3'®) = the expected loss when using the invariant conditional predictor /J® = (a,0). Then: 

E^t >E^t (Ept(P*)) 

^ (/3g®)*diag(al:)(/ig®) + (/ig®)^ (CyE^ + a^) +Vy - 2a‘diag(ag)/3g® > 

{Pz^f (CyS^ + al) > 2 a*diag(o-x)^g® - (^g®)*diag(crx)^g® - a‘diag(o-x)a 

iPz^)^ (VyS^ + cr^) > -^(/3g®)^Q!‘diag(o-|:)a, (14) 

by replacing /3g® = a —a^/3g®. This inequality holds true for any value of the variance E^, and the pooled coefficient 
leads to larger error in expectation. 

Consider now that the coefficients 7 *^ are fixed and centered around a non-zero value /r. Then the expectation 
with respect to 7 ®^ of the loss in the test task is the following: 

E^t (£:pT(/i^®)) = (/3g®)*diag(ag);3g® + (/3g®)^ {yy{Y?+fjP) + ol) 

-I- 2,flg®a*diag(CTg)/3g®/i -h Vy - 2Q*diag(crg)/3g® - 2/3g®Vy^. (15) 


Then, if 7 7 ^ 0 (if 7 = 0, both estimators coincide): 

e.^t > E^t (Ept(/1 ®*)) ^ E" > P(/i), 


where T’(/r) = —^jP 


2 (_ 1RCS\ a‘diag(a-^)a 

pCS (^ 1,2 dPZ } Vy 


v^Q!*diag(cr^)a -h 


(16) 


A.3 Proof of Proposition 

Proof. For k £ {1 ,... ,D,T}, let be the probability distribution with density: 

g''(xs.,xjv,J/) :=p''(xs.,y)p®’(xjv|xs*,i/). (17) 

In the test task T, we trivially have ■ First, it is easy to see that q'^ and have the same marginal distribution 

over xs* and y. Indeed: 

P(xs*,t/)=/ q’"{xs-‘,XN,y)dXN 

JrI-V 

= / p'“(xs., J/)p®’(xAr I xs.,y)dxjv 

VrI-v 

= p(xs*,y) / P^(X]V I xs*,i/)dxiv =p'“(xs»,y). (18) 
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Second, we prove that the conditional q^(ii \ xs*,xjv) is the same in all tasks. Indeed, by applying Bayes’ rule: 


g''(i/|xs»,xjv) 


q’^ijx.N I y,^s*) 
{XN I y,xs-) 
{XN I 1 /,XS») 

p^(xv I 1 /,XS») 

p'^ (XN I y,xs-) 


g*’(r/,xs.) 
q'=(xs.,xjv) 
g''(y |xs«) 

g'=(xjv I xg.) 

p*’(y|xs.) 

/r9'“(j/,xjv |xs*)rfy 

/k ’1’° I y, xs* )q'=(i/1 xs* )dy 
p*’(g|xg») 

4p^(xv I i/,xs.)p''(y I xs*)di/' 


We have used the fact that g*’(xv | y, xg*) = p^(xv | y, xg*), which follows from ( |18[ ). Since the last equality 
leads to a term which is equal in all tasks (indeed, Assumption (Al) ensures that p (j/|xg») is the same for all 
k £ {1 ,..., D,T}), we have the desired result. □ 


A.4 Statement and proof of Proposition 

In this Section, we provide an analytic expression for /3°^‘ from in terms of a and e. 

Proposition 3 Assume that Xg* follows an arbitrary distribution and that Assumptions (Al) and (A2) hold. Let 
7 € be the solution of an regression from on Therefore, we can write X^ = + p, with p 

uncorrelated to and the components of p can be eorrelated. Then the regression coefficients (5°^^ = (/I 5 ?, 
minimizing the expected squared loss in the test task satisfy 

=E(e")M-S, (19) 

/If.* = a (1 - (7^)‘,3 v) - Ef g. , ( 20 ) 

where M = E(e^)77* + Ev — .Ex,v, andUx ■■=E{pp^), Ex.s* :=E(Xg*Xf), Ex,v :=E(Xg*77*) are the 

corresponding Gram matnces|^ 

Proof. To simplify notation, we write , Xf and X^ as Y, Xg* and Xv. We compute the gradients of the 
expected squared loss after replacing the expression for Y and Xg*: 

L = E(y-^f Xg* -/ll^Xv)" 

= ( 0(1 — 7*/3 v) — Ps* )*Ex,g* (a(l — — Ps*) 

+ (1 — /5x7)^E(e^) + PxYnPn — 2(q:(1 — P^Pn) — Ps* YYx.nPn 


The gradients satisfy 

dL 
dPs* 
1 dL 
2 d^ 


—2Ex,g* (a(l — 'J^Pn) — Ps *) + 2T,x,nPn 
YnPn — (1 — 7*/Iv)E(e^)7 + po'Tix.nPn 
- 7a*Ex,g*(a(l - P^Pn) - Ps*) - Ex,v(a(l - T^Pn) - Ps*) 


By setting these to zero, we find the stated values for /If* and /If*. 


□ 


■^We dropped the superscript T to lighten the notation. 
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